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Abstract 

The Yule model and the coalescent model are two neutral stochastic models 
for generating trees in phylogenetics and population genetics, respectively. Al- 
though these models are quite different, they lead to identical distributions 
concerning the probability that pre-specified groups of taxa form monophyletic 
groups (clades) in the tree. We extend earlier work to derive exact formulae 
for the probability of finding one or more groups of taxa as clades in a rooted 
tree, or as 'clans' in an unrooted tree. Our findings are relevant for calculating 
the statistical significance of observed monophyly and reciprocal monophyly in 
phylogenetics. 
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1. Introduction 



When gene trees are estimated from multiple lineages taken from two or 
more populations, there is an increased chance that lineages within each popu- 
lation form monophyletic groups compared to sampling multiple lineages from a 
single population. This observation has led to the adoption of a null hypothesis 
that a set of lineages belongs to a single population or taxonomic group, in ask- 
ing whether a particular group of lineages came from a taxonomically distinct 
population 14 , ^ . Statistical tests for reciprocal monophyly between two sister 
taxa can then be developed to test against this null hypothesis [1, [13] • 

Reciprocal monophyly is central to the genealogical species concept. Accord- 
ing to this concept two groups come from different species if they form distinct 
monophyletic groups [^, |3| • Gene trees from lineages sampled from one or more 
populations are typically estimated, and monophyly (or lack of monophyly) of 
these groups can be observed from the clades of the gene tree. Statistical tests 
for whether observed levels of monophyly provide sufficient evidence to conclude 
that a group is taxonomically distinct can be performed, given a probabilistic 
model for the clades on a tree [13] . 

Two neutral models - involving different evolutionary scales - are useful 
in this context. The Yule (pure birth, or birth-death) model describes the 
speciation (and extinction) of lineages at the species level as one moves forward 
in time, while Kingman's coalescent process is a population genetic model that 
the ancestry of individual lineages back in time as they coalesce (and thereby 
form a tree). These are two quite different processes and lead to different branch 
lengths on trees; remarkably, however, they generate identical distributions of 
tree topologies [ij. Thus, while the coalescent process is a natural model for 
trees in single populations, the equivalence of the Yule and coalescent models 
for tree topologies means that results for the Yule model can be exploited in 
studying probabilities of clades for coalescent trees in single populations. 

Although there has been an emphasis on testing for the taxonomic distinc- 
tiveness of one group of lineages, joint probabilities of clades could be used to 
examine whether the observed monophyly of several groups is statistically sig- 
nificant using a single test. Such an omnibus test of the null hypothesis that all 
groups come from one population might be more powerful than testing several 
groups one at time. 

In this note, we derive exact formulae for the joint probabilities of k clades 
for a random Yule/coalescent gene tree under the conditions that the k clades 
are mutually exclusive (they have no leaves of the gene tree in common) , and are 
either exhaustive (all leaves of the gene tree occur in one of the k clades) , or form 
only a subset of the leaves of the gene tree. These results generalize results from 



12|, which provided an explicit formula for the probability that two mutually 



exclusive and exhaustive sets of leaves formed clades on a Yule/coalescent gene 
tree. 

In addition, we extend the results to unrooted trees by giving the probabil- 
ities of 'clans' (sets of leaves that are all on one side of a split [17[), as well as 
the joint probability of fc > 1 clans, on Yule/coalescent trees which have been 
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unrooted. This extension is relevant when only unrooted trees can be estimated, 
which is particularly common in prokaryotic evolution [loj . 



2. Clades 

Throughout this paper we will let (or, more briefly, X) denote a set of 
taxa of size n. Given a rooted phylogenetic X-tree Tx (more briefly T), with 
leaf set X = X„, a clade of T is a subset of X that corresponds to the set of 
leaves that are descended from any internal vertex. For example, in Fig. [Ha), 
the sets {3,4} and {1,2,3,4} are two clades. Any two clades A and B oi T 
satisfy the following compatibility condition: 

AnB e {A,S,0}. (1) 

This is equivalent to requiring that A = one set is a strict subset of the other, 
or the two sets are disjoint. 

We will let c(T) denote the set of clades of T, and say that a clade is proper 
if it is a strict subset of X. Notice that a rooted phylogenetic tree X-tree has 
at most 2n — 1 clades, and it has precisely this number if and only the tree is 
binary, that is, if each non-leaf vertex has two descendant vertices. 





(b) 



Figure 1: (a) This rooted tree has 13 'clades', including the three sets circled 
({1, 2}, {1, 2, 3, 4}, {6, 7}). In this tree {1, 2} and {3, 4} are sister clades, but {1, 2} and {6, 7} 
are not. (b) The unrooted tree T~P obtained from the tree T in (a) by suppressing the root 
vertex p. This tree has {3, 4, 5, 6, 7} as a 'clan', even though this set is not a clade of T. 



3. The Yule-Harding-Kingman process 

Consider the probability distribution on binary phylogenetic X-trees de- 
scribed by a model that grows a tree by selecting a leaf uniformly at random 
and 'splitting' it into two new leaves, as illustrated in Fig. [21 Since we are ignor- 
ing branch lengths in this paper and concentrating just on tree topologies, the 
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resulting probability distribution on rooted binary tree topologies is the same 
as that given by any (stationary or non-stationary) birth-death process on trees 
in which birth (speciation) and death (extinction) events apply exchangeably 
to all the species extant at any given moment (see [if for further details) . This 
is useful, since the rates of speciation and extinction throughout time may be 



both time-dependent and variable according to the number of taxa present 11 1. 

The study of such pure-birth trees was initiated in Yule's 1925 paper [18|, 
and the probability distribution on tree topologies (without reference to branch 
lengths) was further studied by Harding [y]. Moreover, this probability distri- 
bution on trees is precisely the same as that given by a quite different process, 
namely Kingman's coalescent process Q in population genetics, which starts 
at the leaves and successively combines pairs of elements, provided that, once 
again, we ignore branch lengths ([H). 

To emphasize this equivalence between a model in macro-evolution (specia- 
tion and extinction) and micro-evolution (population genetics) we will refer to 
it as the Yule-Harding-Kingman (YHK) process for generating tree topologies. 

We will also refer to a random binary phylogenetic X-tree produced by 
any of these stochastically equivalent processes as 73s: (or often just T if X is 
clear), and so P(73f = T) is the probability that T is the actual phylogenetic 
X-tree produced by the process. The process, viewed as a pure-birth model, is 
illustrated in Fig. [5J 

In this paper, we exploit two important properties of the process that gen- 
erates T. First we recall some notation that will be used throughout: for any 
phylogenetic X-tree and any non-empty subset Y oi X, let Txyy be the phy- 
logenetic tree induced by restricting the leaf set to Y (as in [15|). The two 
properties that the YHK process enjoys, and which we will exploit throughout 
this paper, are the following: 

(EP) If T' is obtained from T by permuting its leaves, then 

p(r = T') = p(r = T). 



(GE) For any proper (and non-empty) subset A of X, and any rooted binary 
phylogenetic tree T with leaf set X — A: 

nrx\(x-A) = T\A e c{T)) = nTix-A) = T). 

Property (EP) is the Exchangeability property [1], which requires that the prob- 
ability of a particular phylogenetic tree depends just on its shape and not on 
how its leaves are labeled (it is called 'label-invariance' in [3|). Property (GE) 
is the Group Elimination property from [1]; it states that, conditional on A 
forming a clade in the tree, the tree structure on the remaining taxa is also 
described by the YHK process. In turn (GE) implies the following Sampling 
Consistency property ([l|): For any rooted binary tree T with leaf set A Q X, 
we have: 

(SC) nrx\A = T) = F{Ta = T). 
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(a) (b) 




(c) 

Figure 2: From a rooted binary tree on three leaves (a), splitting the right leaf (*) leads to a 
'balanced' tree shape (b), while splitting either of the other two leaves produces an unbalanced 
tree (c). Thus the balanced tree shape has probability 1/3 and as there are three distinct ways 
to label the leaves, each of these rooted binary phylogenetic trees has probability 1/9 under 
the YHK process. For a phylogenetic tree of shape (c), the probability is 1/18. 



To see that (GE) implies (SC), one sequentially deletes leaves that are not in 
A, noting that each leaf is, trivially, a clade in any tree. 

4. Clade probabilities under the YHK process 

The following result is stated and established in the appendix of ; it is also 
stated and proved in [l^] (Theorem 4.4), and in Q (Proposition 2). A further 
proof of this result is also possible based on induction on n and using the well- 
known property of the YHK model that the number of leaves in one of the 
(randomly selected) maximal subtrees of 7x is uniformly distributed between 1 
and n — 1. 

Lemma 4.1. Let Xn{a) be the number of proper clades of size a in Tx ■ Then 

2n 

E[X„(a)] - , , 1 < a < n - 1. 
a(a + 1) 
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For a subset A oi X, let Pn{A) be the probability that A is a proper clade 
of Tx- From (EP) it is clear that this probability depends only on a = \A\ and 
n, and so we can write p„(a) for this probability. From [l^l we have: 



Lemma 4.2. 

Pnia) 



-T^r) , ifl<a<n-l; 
0, otherwise. 



The proof of this result from jl2| relies on a combinatorial identity to sum a 
series. Here we point out how Lemma 14.21 follows very directly from Lemma |4. II 
Proof of Lemma \4.S\ ' For 1 < a < n ~ 1, the exchangeability property (EP) 
implies that: 



Pn{A) = ^ P(T has k clades of size a) ■ —— — E[X„(a)] 

\a) 



k>0 

where X„(a) is as defined in Lemma [4. II This completes the proof. □ 

4.1. Pairs of clades 

For a pair A, B of disjoint subsets of X, let Pn{A, B) be the probability that 
A and B are sister clades of Tx (i-e. A,B and AiJ B are clades of Tx)- By 
exchangeability (EP), this probability depends on a = \A\,b = \B\ and n only, 
and so we will denote it p„(a, b). 

Consider first the special case where n — a + b; that is, A and X — A are 
sister clades, which is equivalent to saying that A is a maximal proper clade. 
From (Equation 6) (see also [l3l)j the probability of this event is given as 
follows: 

Lemma 4.3. For 1 < a < n, we have: 

Pnia, a) 



n — 1 V a 

We generalize this slightly as follows: 
Lemma 4.4. Let k = a + b < n. Then: 

Aalb\{n-k)l 



Pnia,b) 



(n-l)!A;(A;2-l)' 
Proof: 

PniA, B)^F{AuBe c{Tx)) ■ P e c{Tx\a^b)\A UBe c{Tx)) 



Applying Lemma 14.21 to the first term, and property (SC) and Lemma 14.31 to 
the second term we have: 



Pn{A,B) 



2n f n \ ^ 2 f a + b^ ^ 



{a + b){a + b+l)\a + bj a + b-l\ a 
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from which the result follows. □ 

Now, for any two arbitrary subsets A, B of Xn — {1, . . . , n}, let B) 
be the probability that a Yule tree T on X„ has A and B as proper clades. As 
usual, let a — \A\ and h — \B\. 

Theorem 4.5. 



Pn{A,B) = < 



'pn(a) 


ifA = B 


[case 1] ; 




i?„(a,6). 


ifACB 


[case 2] ; 




i?„(6, a). 


ifBCA 


case 3] ; 




■p„{a,n- a), 


ifAnB^ 


= 0, A U B = X„ 


[case 4] 




ifAnB^ 


= 0, A U B C X„ 


[case 5] 


A 


otherwise 


[case 6] ; 





where 

Pn{a), and p„(a, n — a) are given by Lemmas\4-^ and 



Rn{a,b) 



An fn\ ^ fb^ ^ 



a{a + l){b + I) \b J \a 



Aa\b\{n-a-b)\ 

rn[a,b) := -. —, G„(a, o), and where 

[n-iy. 

, , _ n a{a + l) + b{b+l)+ab ^ 1 



ab{a + l){b + 1) ab{a + l){b + l)(a + 6+1) (a + b){{a + by - 1) ' 



Proof: Cases 1 and 4 are given by Lemmas 14.21 and 14.31 respectively. For the 
second case {Ac. B), we have: 

p„(A, B) = ¥{A e c{Tx)\B e c{Tx)) ■ P(S e cljx))- 

Since A C _B we can apply property (SC) and Lemma 14.21 to deduce that the 
first term in this product is ^(^^(a) while the second term is ^, 
from which the result follows. Case 3 follows by an analogous argument. For 
Case 5, consider the following two pairs of events: 

. £i: A,B ec{Tx), 

• £2:AUB,Be c{Tx), 

• Ti: Ae c{Tx\(x-B)), 

• J-2 : S e c{rx). 

We are interested in computing P(f i) since this is Pn{A, B) and by the principle 
of inclusion and exclusion we have: 

¥{£i) = P{£i U £2) + P(fi n £2) - P(f2). (2) 
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Now, £i U 82 occurs precisely if J"i fl occurs (since £1 U £2 is the event that 
B e c{Tx) and either A € c(Tx) or AU B e c{Tx))- Thus: 

P(fi U £2) = Pl-Fil -^2) • F(-7^2). 

Combining this equation with ([2]) and noting that F{£i D £2) = Pn{A,B) and 
PniA, B) = P(fi), we obtain: 

Pn{A,B) = P(fi) = P(J-i|J-2) • P(J-2) - P(f2) +P„(A,B). (3) 

Now, by (GE), 

n^i\T2) = P(A e c{Tx-b)) = P„-b(a), (4) 

and 

P(f2) = P(^UB e c(rx))-P(B e c(rx)|AUB e c(rx)) = Pn{a+hypa+b{h). (5) 

Thus, substituting (jj]) and ([5]) and the equaUty P(J^2) = into (|3]), we 

obtain: 

Pn{A, B) = Pn-b{a) ■ Pn{b) - Pn{a + b) ■ Pa+b{b) + Pn{a, b). 



Case 5 now foUows from Lemmas 14. 2[ 14.41 Case 6 foUows from the compatibihty 
condition ([l} for clades. □ 

We now ask whether the events 'A is a clade' and 'i? is a clade' are positively 
or negatively correlated under the YHK process. Let Xa (respectively Xb) be 
the Bernoulli (0,1) random variables that take the value 1 if A (respectively 
B) is a clade of a YHK tree T on X„ and let Pn{A, B) denote the correlation 
coefficient of these two random variables, which is given by: 

I. m Pn{A,B)~Pn{A)pn{B) 
Pn{A,B) - 



v/p„(A)(l -p„(A))p„(B)(l -p„(S)) 
Corollary 4.6. For any two strict subsets A, B of X, the correlation pn{A,B) 



• strictly negative, if A, B are not compatible, and undefined if \A\ — 1 or 
\B\ = 1. 

• strictly positive, otherwise. 

Proof: If A and B are not compatible, then Pn{A, B) ~ but both Pn{A) and 
Pn{B) are greater than zero, and so pn{A,B) < 0. If \A\ — 1 then Pn{A) = 1 
and Pn{A, B) = Pn{B) (regardless of whether A is a subset of B or is disjoint 
from B). Thus the numerator and denominator of pn{A, B) are both zero. A 
similar argument holds if \B\ = 1. 

In the remaining cases, we consider the ratio Pn{A, B)/ {pn{A)pn{B)). For 
example, in Case 2, we have: 

PniA,B) _ (n - 1) •••(«- a + 1) 



Pn{A)-pn{B) {b-l)---ib-a + l) 
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This is strictly > 1 since > 1, ■ 



n—a+l 



> 1. Similar arguments apply in 



6-a+l 

the other cases; however Case 5 requires some detailed algebraic manipulation. 
□ 

Fig. [3] illustrates the correlation coefficient /?„ {A, B) for n = 25 in the Cases 
2, 4 and 5. Notice that, the correlation is typically much smaller in Cases 2 and 
5 than for Case 4. 
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Figure 3: Graphs of Pn{A, B) for n = 25, in Cases 2,4 and 5, with a = [j4| and b = \B\. 



5. Extension to partitions of X. 

Suppose that the collection of sets Ai, A2, . . . , Ak forms a partition of X, 
and let ai — \Ai\, for i = 1, . . . , fc, so that n = \X\ = X^iLi '^i- For a rooted 
YHK tree T, let p(ai, . . . , Uk) be the probability that Ai, A2, . . . , Af, are clades 
of T (this probability depends only on the cardinality of the sets by the ex- 
changeability property). For example, p(2,2,2)=2/225, and from Lemma f4.3| 
we have: ^(01,02) = (°^(^°^) • Our aim in this section is to generalize 

this to larger values of k. In order to do so, we describe a new result for the 
Yule model, which requires a further definition. 

For a rooted YHK tree T, and a rooted phylogenetic tree Tk with leaf set 
{1, . . . , k}, let p{ai, . . . ,ak', Tfc) be the probability that Ai, A2, . . . , Ak are clades 
of T and that Tk is the tree obtained from T by replacing each clade Ai by 
a single leaf labelled i, for i = 1, . . . , k. Let I{Tk) denote the set of interior 
vertices of Tk- 

Theorem 5.1. For fc > 1, we have: 
(i) 

.i(T.)VE-=i«.^(^0-iy 

where Iv{Ai) is the indicator variable that takes the value of 1 if i lies 
below V in Tk and otherwise. 
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p{ai,...,ak) = ^p{ai,...,ak;Tk), 

Tk 

where the summation is over all distinct rooted binary phylogenetic trees 
on leaf set {1, . . . , fc}. 



Proof: We prove the result by induction on k. For k — 2, Lemma 14.31 gives 
p{ai,a2;T2) = Pn{(ii,<i2) = ;7zrL(a) ' where n = ai + 02, which agrees with 
the expression given in part (i) with k = 2. 

Now suppose that part (i) holds whenever k is less or equal to to > 2; 
we will show that it also holds when k = to + 1. Thus, suppose we have 
a collection C = {Ai, . . . , Am+i} that partitions X, and also have a rooted 
binary phylogenetic tree T„i+i on leaf set {l,...,m + l}. Then Tm+i has a 
cherry (two leaves adjacent to the same vertex). Without loss of generality (by 
re-ordering the sets if necessary), we may suppose that these two leaves are to 
and m + 1. Consider the collection of m sets obtained from C by replacing 
A.jn and A,„+i by their union, and let T' be the tree obtained from T„i+i by 
deleting the leaves m and to + I along with their incident edges and labelling the 
exposed vertex by to. Notice that T' rooted binary phylogenetic tree that has 
leaf set {1, . . . , m}. By the exchangeability and group elimination (via sampling 
consistency) properties we have, for a'^ := am + a-m+i, the following identity: 

p{ai, . . . ,a„i+i; Tm+i) = p{ai, . . . ,a^; T') ■ pa'^{a,n, a,„+i), 

where Pa'^ {am, a,m+i) is the probability that a Yule tree on leaf set Am U Am+i 
has Am and Am+i as sister (and thus maximal) clades. Applying the induction 
hypothesis for the first term on the right-hand side of this equation, namely 
p{ai, . . . , a'm, T'), and applying Lemma [4.31 for the second term, and collecting 
terms, leads to the expression in Part (i) for fc = m -I- 1 and thereby justifies the 
induction step. 

Part (ii) follows by observing that each tree T that has Ai, . . . ,Ak as clades 
has one (and only one) associated tree Tk, and so these trees provide a partition 
of the event for which the probability is given by p(ai, . . . , Ofc). □ 

As an illustration of Theorem 15. 1[ we have the following result for fc = 3: 



4ai!a2!a3! 

p(ai, 02,03) = — — 

n][n — 1) 



' ^ 1 

F — - — 

.i=l 



where n — ai + 02 + a^. 

We note that, as well as being a generalization of Lemma 14.31 to fc > 2, 
Theorem IS.lf i) also generalizes the classic result that the probability that a 

YHK tree T has a given tree topology Tk is ^-jj- Yly^x{Tk) (^ n -1 ) ' '^tiere n„ is 
the number of leaves of Tk below v (see ^ or [lH). This can be seen by setting 
oi = a2 = • • ■ = flji = 1 in Theorem IS.ir i) . 
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6. Extension to unrooted trees 



If we suppress the root p of a rooted binary phylogenetic X-tree T, we 
obtain an unrooted binary phylogenetic X-tree, which we will denote as T~p 



(as shown in Fig. [Hb)). Following 17[, (see also [lO[) we say that a subset A 



of X is a clan of an unrooted phylogenetic X-tree T' ii A\X — A is a. split of 
T'. Note that any clade of the rooted tree T becomes a clan of T~p. However, 
this latter tree also has additional clans that do not correspond to a clade of T. 
The precise relationship is given as follows: 

Lemma 6.1. Given a rooted binary X-tree, T, a set A is a clan ofT^P if and 
only if either A is a clade of T or X — A is a clade of T . 

Now suppose the rooted phylogenetic tree T is generated under the YHK 
process. Then we obtain an induced probability for the unrooted tree T~p. Note 
that the same unrooted tree can arise from different rootings. This probability 
distribution on unrooted phylogenetic trees can also be described directly as a 
Yule-type process on unrooted trees in which, at each stage, a leaf is selected 
uniformly at random and a new leaf (with a random label) is attached to its 



incident edge (see e.g. [16|). Fig. 2] illustrates how different leaf choices in this 
process lead to different shapes of unrooted trees. 

For a strict non-empty subset A of X„, let g„(A) be the probability that A 
is a clan of the unrooted YHK tree on leaf set X„; by (EP) this depends only 
on a = |A| and n so we will also write it as q-nio)- 

Lemma 6.2. 

Ill 



g„(a) = 2n 



_a(a + l) b{h+l) {n-l)n 

where a — \A\,b — n — a. 

Proof: By Lemma l6.1[ we have: 

qn{A) =p„(A)+p„(X-A)-p„(A,X-A). 

Applying Lemmas 14.21 and lOl noting that Pn{A, X — A) = Pn{A, X — A), leads 
to the claimed equation. □ 

Now consider two disjoint subsets A and B oi X, and let qn{A,B) be the 
probability that A and B are both clans of the unrooted YHK tree on leaf set 
X„. By (EP), this probability depends only on a = \A\,b — \B\ and n, and so 
we will denote it as q„(a, b). As an example, we have: 

96(2,2) = 7/225. 

To see this, observe that if we take (say) A = {1, 2}, B = {3, 4} then, referring 
to Fig. m there is just one tree of shape (b) and two of shape (c) that has both 
clans A and B. Thus, qe{2,2) = lx^ + 2x We now give an exact analytical 
formula for qn{a, b). 



11 




(a) 





(b) 






(c) 

Figure 4: Only one unrooted binary tree shape is possible with five leaves (a), but two are 
possible with six leaves (b, c). If the 'central' leaf (*) of tree a is split to form two leaves, then 
we obtain tree shape (b), while splitting any one of the remaining four leaves produces tree 
shape (c). Thus, tree shape (b) has probability 1/5. Since there are 6!/3!2'^ = 15 distinct ways 
to label its leaves, each of the resulting phylogenetic trees has probability 1/75. By contrast, 
any phylogenetic tree of shape (c) has probability 4/5 X 1/90 = 2/225. 



Theorem 6.3. 

(i) If a + b — n, then: 

qn{a, h) = qa+b{A) 



2a\h\ 



{a + b-l)\ 



+ 



_a(a + l) b{b + l) (a + 6)(a + 6- 1)_ 

(a) If a + b < n then: 

g„(a, b) = r„(a, &)+i?„(a, n-6)+i?„(6, n~a)-pn{b, n-b)pn^b[a)-pn{a, n-a)pn-a{b), 



where the first three quantities are as given in Theorem \4-5\ ( Cases 2, 3 
and 5), while the last two terms are given by Lemmas\4-.2\ and\4^ 



Proof: Part (i) follows from Lemma 16.21 noting that n = a + b. For part (ii) , 
Lemma lOl implies that A and B are clans ol T^p precisely if one of the following 
three events occur: 
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(a) A and B are clades of T; 



(b) A and X — B are clades of T, but B is not a clade of T; 

(c) B and X ~ A are clades of T, but A is not a clade of T; 

(Note that X — A and X — B cannot both be clades of T, by the compatibility 
condition ([ij since {X — A) r\{X — B) ^ by the assumption that a + 6 < 
and since X — A neither contains nor is contained in X ~ B). Moreover, the 
three events (a), (b), (c) are mutually exclusive, by virtue of the the assumption 
that A, B are disjoint and their union is a strict subset of X. The probability of 
Event (a) is rn{a,b), while the probability ofEvent (b) is Rn{a,n — b)—pn{b,n — 
h)pn-b{o) since the first term is the probability that A and X — B are clades of 
7", and Pn{b,n — b)pn-b{a) is the probability that A,X — B and B are clades 
of T. Similarly, Rn(b, n — a) — Pn{a,n — a)pn-a(b) is the probability of Event 
(c). The result now follows by adding the probabilities of these three mutually 
exclusive events. □ 



6.1. Extensions of the clan condition (I) 

For a pair A, B of disjoint subsets of X a weaker condition than requiring 
that A and B are both clans of T"'' is simply to require that at least one edge 
of this tree separates A from B. Let Qn{A, B) be the probability of this event 
for an unrooted YHK tree on the leaf set X„. Then we have the following 
result, which follows from the sampling consistency (SC) property applied in 
the unrooted setting. 

Qn{A,B)^qa+b{A), (6) 

where qa+b{A) is given by Theorem 16. 3f iV 

6.2. Extensions of the clan condition (II) 

We now describe a second extension. Suppose Ai, A2, . . . , partition X, 
and, as usual, let = \Ai\. For an unrooted YHK tree T let (?(ai, . . . ,0^) be 
the probability that Ai,A2, . . . ,Ak are clans of T and let q'(ai, . . . , Ok) be the 
probability that Ai, A2, . . . , Ak are convex on T (that is, the minimal subtree 
connecting the leaves in Ai is vertex disjoint from the minimal subtree connect- 
ing the leaves in Aj for all pairs see [15[ for further details and the biological 
significance of convexity) . 

We have calculated q when k = 2 above (and q' = q in this case). We turn 
now to the next case of of interest, fc = 3, where, for example, we have: 

g(2,2,2) = 1/75, and (7'(2,2,2) 1/15. 

The following result provides an exact formulae for these two quantities for 
arbitrary {ai, a2, a^). 

Theorem 6.4. Let n — ai + a2 + 03. Then: 
(U) q{a,,a2,as) = ^^^0^ 



(n-a,)((n-ai)^-l) 
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(a) g'(ai, 02,03) = g„(ai,a2) + g„(ai,a3) + g„(a2,a3) - 2g(ai,a2,a3), where 
qn{ai,aj) is given in Theorem \6.3Y ii). and q{ai,a2,a3) is from part (ii). 

Proof: For part (i), the event that Ai,A2 and A3 (which partition X) are 
clans of T~'' is the union of three disjoint events Ejk over the three choices of 
{jj f^} £ {{Ij 2}, {1, 3}, {2, 3}}, where Ejk is the event that the union of two of 
the sets - say Aj and Ak - must be a clade of T, and that this clade has maximal 
clades Aj and A^. The exchangeability and group elimination conditions then 
give: 

3 

9(01,02,03) = P(£;i2) + P(£;i3) + P(£^23) = 51 ^"('^ ~ a,) •Pa,+afe(ai,afc), 

1=1 

where {0^,0^,0/0} ~ {1,2,3} in the term on the right-hand side of this last 
equation. By Lemmas 14.21 and 14. 3[ this gives: 

3 

\ 2n {n-ai)\ail 2 Ojlofc! 

9(01,02,03) = > 7 77 — T j 7 -T7 TT 

^ [n - ai)[n - ai + I) n\ (n - ai - I) (n - aiji 

which simplifies to the expression given in (ii). 

For part (ii), the event that Ai,A2 and A3 are convex on T~'^ is the 
union of three (non-disjoint!) events E'^^. over the three choices of {j, k} G 
{{1, 2}, {1, 3}, {2, 3}}, where iJjj, is the event that two of the sets - say Aj and 
Ak - are clans of T~^. Note that the intersection of any two (or three) of these 
three events is simply the event that all three sets are clans of T, which was 
dealt with in part (i). Thus, by the principle of inclusion and exclusion, we 
have: 

g'(oi, 02, 03) = nE[^) + PiE[:,) + P(i;^3) - 2<7(ai, a2, 03) 
and the result in part (iii) now follows. 

□ 

Deriving explicit formulae for q{ai, . . . , a^) and q'{ai, . . . , a^,) for fc > 3 is, 
in principle, possible but the formulae quickly become increasingly complex. 

6.3. Extensions of the clan condition (III) 

A third extension is to consider the probability Qn{Ai, A2) that two sets 
Ai, A2 are clans of a YHK tree on n leaves when these two sets are not disjoint. 
For this setting we have the following result. 

Proposition 6.5. Suppose Ai,A2 are non-disjoint subsets of X , and Oi — \Ai\. 
(i) If Ai C A2, then: 

Qn{Ai,A2) = g„(ai,n - 02), 

where qn{*,*) is given by Theorem \6.SX Similarly, if A2 C Ai then 
Qn{Ai,A2) = (?„(n - 01,02). 
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(ii) Otherwise, if neither set Ai,A2 is a subset of the other, then: 



Qn{Ai,A2) 



( 



Qn iai - ai2,a2 - au), if Ai Li A2^X; 
0, otherwise. 



where ai2 = \Ai n A2\, and *) is given by Theorem \6.Sl 

Proof: First observe that if Ai C A2 then Ai and A2 are clans of an unrooted 
phylogenetic X-tvee T if and only if Ai and X — A2 are clans of T. Noting that 
these are disjoint sets, the first part of Proposition 16.51 follows from Theorem 
16.31 For the second case, where neither set Ai,A2 is a subset of the other, 
first observe that in order for Ai and A2 to be clans of the same unrooted 
phylogenetic X-tree T a necessary condition is that AiU A2 — X . Moreover, 
under this condition, Ai and A2 are clans of T if and only if Ai — n A2 
and A2 ~ Aif] A2 are clans of T; as these are disjoint sets, the second part of 
Proposition 16.51 follows from Theorem 16.31 □ 

7. Discussion 

The arguments we have used in our analysis have primarily relied on repeated 
application of the properties of exchangeability (EP) and group elimination 
(GE) (or its corollary, sampling consistency (SC)) for the YHK model, together 
with Lemmas 14.21 and 14.31 However other natural models for trees can also 
satisfy some of these properties. Indeed the distribution that assigns each rooted 
binary phylogenetic tree on Xn the same probability (sometimes known as the 
'Proportional to Distinguishable Arrangements', or PDA model) satisfies both 
(EP) and (GE) [1]. This suggests that by finding and applying the corresponding 
results to Lemma 14.21 and 14.31 for the PDA model, one could develop a parallel 
line of results for the PDA model to most of the analysis we have provided in 
this paper for the YHK model. 

Unfortunately only one other model, apart from PDA and YHK, is known 
to satisfy both (EP) and (GE) and this model is not of biological interest, as it 
only generates pectinate (comb-like) tree shapes. Aldous 1] has conjectured that 
these are the only three distributions on rooted binary phylogenetic trees that 
that satisfy both (EP) and (GE). Nonetheless, it may be of interest to explore 
models that satisfy weakened assumptions - for example, (EP) and (SC), or just 



Even with (EP) alone, one can devise meaningful statistical significance tests. 
For example, suppose N taxa include one or more particular (disjoint) subsets 
(different 'types' of taxa) Ai, A2, . . . , Ak, where k > 1. Consider any model for 
generating a rooted binary tree that satisfies the exchangeability property (EP), 
and let p„ be the probability that a tree on this set of taxa as leaves, generated 
under this model, has at least one clade of size at least n consisting of just one 
type (i.e. all leaves in the clade are a subset of one of the sets Ai, . . . ,Aj.). Then 
we have the following result, the proof of which is given in the Appendix. 



(EP). 
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Proposition 7.1. For any probability distribution on rooted binary trees satis- 
fying the exchangeability property (EP), we have: 



where Oj — \Ai\. 

As a simple example, suppose we have = 40 taxa, including two disjoint 
groups, each containing six taxa. For a tree generated under any model that 
satisfies the exchangeability property, the probability that this tree would con- 
tain a clade of size four of larger consisting entirely of taxa from one of the two 
groups is, at most: 
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9. Appendix: Proof of Proposition [77l] 

Let Xrn^i be the number of clades of size m in the randomly-generated 
tree that has the property that the taxa are all of type Ai, and let AT := 
^i=iJ2m=n-^m,i- Then pn = P(A > 0). Since A is a non- negative integer 
random variable, we have: 

P(A>0)<E[A]. (7) 
By linearity of expectation we have: 

k ai 
i—l m—n 
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Moreover: 

nx„,,,] = Y,nxrnMnt), (q) 
t 

where the summation is over all binary tree shapes on the given leaf set of 
size N, E[Xm,i\t] is the conditional expectation of Xm,i given that t is the tree 
shape generated by the random speciation process, and P(i) is the probability 
of generating tree shape t. For any given the tree shape t: 

E[XrnAt]= (10) 
v-.n^j — m 

where the summation is over all the interior vertices of t for which the number 
of leaves below v (riy) is m, and where ly^i is the binary random variable that 
takes the value 1 precisely if all the leaves below v are of type Ai, and = 
otherwise. Now, by exchangeability, we have the following identity for any 
vertex v oi t with n„ = m: 



("0 



E[/„,,lt]=P(/,,, = = (11) 



Now any tree shape on TV leaves has, at most, N/m vertices v for which n.„ — m, 
and so we obtain, from (ITO)) and (ITT]) . E[X„i^i|t] < ^ • |^ = IT^' ^^^^^ ^^^^ 
inequality holds for all tree shapes t, Equation ^ implies that: E[Xm i] < 

01 



fM-i^ ■ The expression for p„ now follows from Equations ^ and ([8]) . 
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